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Querying large read collections in main memory: 
a versatile data structure 

Nicolas Philippe^'^, Mikael Salson^'^ Thierry Lecroq^, Martine Leonard^, Therese Commes^ and Eric Rivals^"" 
Abstract 

Background: Higli Tlirougliput Sequencing (HTS) is now lieavily exploited for genome (re-) sequencing, 
nnetagenonnics, epigenonnics, and transcriptomics and requires different, but connputer intensive bioinfornnatic 
analyses. When a reference genonne is available, nnapping reads on it is the first step of this analysis. Read nnapping 
programs owe their efficiency to the use of involved genome indexing data structures, like the Burrows-Wheeler 
transform. Recent solutions index both the genome, and the /c-mers of the reads using hash-tables to further 
increase efficiency and accuracy. In various contexts (e.g. assembly or transcriptome analysis), read processing 
requires to determine the sub-collection of reads that are related to a given sequence, which is done by searching 
for some /c-mers in the reads. Currently, many developments have focused on genome indexing structures for read 
mapping, but the question of read indexing remains broadly unexplored. However, the increase in sequence 
throughput urges for new algorithmic solutions to query large read collections efficiently. 

Results: Here, we present a solution, named Gk arrays, to index large collections of reads, an algorithm to build 
the structure, and procedures to query it. Once constructed, the index structure is kept in main memory and is 
repeatedly accessed to answer queries like "given a /c-mer, get the reads containing this /c-mer (once/at least 
once)". We compared our structure to other solutions that adapt uncompressed indexing structures designed for 
long texts and show that it processes queries fast, while requiring much less memory. Our structure can thus 
handle larger read collections. We provide examples where such queries are adapted to different types of read 
analysis (SNP detection, assembly, RNA-Seq). 

Conclusions: Gk arrays constitute a versatile data structure that enables fast and more accurate read analysis in 
various contexts. The Gk arrays provide a flexible brick to design innovative programs that mine efficiently 
genomics, epigenomics, metagenomics, or transcriptomics reads. The Gk arrays library is available under Cecill (GPL 
compliant) license from http://www.atgc-montpellier.fr/ngs/. 



Background 

Next-generation sequencing technologies are presently 
being used to answer key biological questions at the 
scale of the entire genome and with unprecedented 
depth. Whether determining genetic or genomic varia- 
tions, cataloging transcripts and assessing their expres- 
sion levels, identifying DNA-protein interactions or 
chromatin modifications, surveying the species diversity 
in an environmental sample, all these tasks are now 
tackled with High Throughput Sequencing (HTS) and 
require different, but computer intensive bioinformatic 
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analyses. Typically, a recent RNA sequencing experi- 
ment (RNA-Seq) produces about 8 million reads of 75 
base pairs each [1], but both the yield and read length 
will increase [2]. 

Mapping the reads against a reference genome pro- 
vides the genomic positions of mapped reads. For 
instance with RNA-Seq reads, these positions allow to 
know whether a gene is expressed in the studied condi- 
tion. The set of mapped positions represents only part 
of the information needed to analyze the reads, and it 
can be obtained only if a genome is available. Indeed, 
other important information are contained in the read 
collection itself. For instance, to determine the fre- 
quency of haplotypes at a SNP position, one needs to 
align the reads related to this position. These can be 
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obtained by considering for some length k, the /c-mers 
overlapping the SNP and searching for the reads sharing 
this /c-mer. This procedure is applicable even in the 
absence of a reference genome, and similar ones can be 
designed to search for a binding motif in ChlP-Seq 
reads, to determine with RNA-Seq data whether differ- 
ent regions of a messenger RNA sequence are suscepti- 
ble to be differentially expressed, etc. 

For tasks like assembly or read clustering, one needs 
to determine reads overlapping each other or that align 
partly one to another. Numerous works on similarity 
search algorithms have developed seed-and-extend stra- 
tegies and shown that it can be performed efficiently by 
searching common /c-mers between two sequences [3,4]. 

Surely, now and even more in the near future, we will 
need efficient indexing data structures to store and 
query large collections of reads in main memory. Up to 
now, a lot of computational research has been devoted 
to read mapping, and the most efficient tools owe their 
efficiency to the use of involved genome indexing data 
structures, like the Burrows- Wheeler transform [5]. On 
the other hand, the question of read indexing remains 
quite unexplored, although the improvements in 
sequencing throughput suggest that such structures will 
become a compulsory part of future read analysis pro- 
grams. A sign supporting this view: even mapping pro- 
grams now start to index both the genome and the k- 
mers of the reads to boost efficiency and accuracy [6]. 

Numerous works have presented data structures to 
index a single text, like the well known Suffix Tree 
(ST) or the Suffix Array (SA) [7,8]. These enable the 
so-called locate query, that is to locate all occurrences 
of a pattern P either from its sequence or from a posi- 
tion j of occurrence in the text, as well as count query 
to obtain the number of occurrences of P. These struc- 
tures can be adapted to index a set of texts, where each 
text differ from each other; the structures are then 
called generalized Suffix Tree [9], or generalized Suffix 
Array (gSA) [10]. This is done by concatenating all 
texts and adding a separator symbol that does not 
belong to the alphabet {e.g., a $ for the DNA alphabet) 
after each text [9], or directly [10]. Then it requires to 
store the length of each text in an additional array to 
correctly answer locate queries. Such algorithms have 
not been adapted to collections of texts, where two 
texts may be equal in sequence but differ in their iden- 
tifier. The reads obtained from sequencers form a col- 
lection, not a set. 

When the total text is too large, compressed indexes 
reduce the memory needed by storing not all, but only a 
certain proportion of the text positions. Compression is 
obtained by sampling the positions to be stored, while 
non sampled positions need to be recomputed at run 
time. This enables the user to control the balance 



between amount of memory and query time. Hence, 
compression has an impact on the time needed to com- 
pute a query. Ferragina et aL report in a large practical 
evaluation of compressed text indexes, that the query 
time of all tested compressed indexes are between 100 
and 1,000 times slower than with a plain SA for an 
index that is 5 times smaller [11]. The FM-index [5] is 
used to index all chromosomes in mapping applications 
[12]. However, the scalability of neither plain nor com- 
pressed indexes to collections of millions of texts has 
not been investigated so far. We thus address the ques- 
tion of indexing large collections of reads with an 
uncompressed index and compare its performance to a 
generalized suffix array and a hash table. Our structure 
aims to save space compared to those indexes while 
globally retaining queries as fast. Thus we avoid the pit- 
fall of compressed indexes which are less space consum- 
ing but slower by orders of magnitude. 

In this work, we propose a new data structure to 
index reads, an algorithm to build the structure, and 
procedures to query it. Our structure, named Gk 
arrays, is kept in main memory once built and repeat- 
edly accessed to answer different kinds of queries like 
"given a /c-mer, get the reads containing this /c-mer 
(once/at least once)". One can ask both for the /c-mer 
positions or simply for the reads containing it, which 
can prove useful in different applications. We focus on 
cases where millions of queries need to be computed; 
clearly, memory usage will be the key issue. An alter- 
native solution is to adapt some uncompressed index- 
ing structures designed for long texts (suffix tree or 
suffix array [9,13]). We compare Gk arrays to such an 
alternative and show experimentally that they process 
queries fast, while requiring much less memory 
(between 2/3 and 1/3 of a suffix array solution). We 
also perform experimental comparisons against a 
method using hash table: it shows that while the hash 
table method can answer quickly to queries it does not 
scale to large collections of reads. 

If in biology the term /c-mer is preferred, computer 
scientists rather use the equivalent words of /c-factor or 
/c-substring; we will stick to the term /c-mer. The Gk 
arrays allow to answer queries related to an input k- 
mer; let us call these /c-mer queries. Before entering the 
algorithms description, we list below the applications of 
/c-mer queries in the analysis of High Throughput 
Sequencing data. The Results section will first present 
our data structure, its construction algorithm and the 
procedures to answer /c-mer queries, then detail the 
experimental comparisons. 

Finally, we discuss the advantages of our structure and 
conclude with future developments. 

Note that this study does not tackle the question of 
read mapping, it focuses on read indexing. 
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Queries and Applications 

Let us give an informal presentation of the problem. We 
are given a collection of q reads of length m and a 
length of substring k such that k < m. 

Suppose one is given a string / of length k\ one does 
not know whether it appears in some of the reads or 
not [i,e,, whether /is a substring of some read). In the 
Algorithm section, we describe a data structure in which 
all substrings of length k of the reads are ordered lexico- 
graphically. Hence, one can search for /using a dichoto- 
mic search in 0(/c log((m - /c + \)q))) worst case time in 
this structure (the dichotomic search is the standard 
procedure in this context [8,9]), and determine whether 
at least one read contains /as a substring and at which 
position. If not, the answers to the queries below, which 
are all related to a sub-collection of reads containing / 
are trivially the empty set or zero. Otherwise, one 
knows that / occurs in some read r of the collection at 
position and wishes to get some information on the 
other reads where /occurs. One wants to answer the 
following questions: 

Ql: In which reads does /occur? 

Q2: In how many reads does /occur? 

Q3: What are the occurrence positions of /in the 
reads? 

Q4: What is the number of occurrences of/ in the 
reads? 

Q5: In which reads does /occur only once? 

Q6: In how many reads does /occur only once? 

Q7: What are the occurrence positions of/ in each 
read where /occurs only once? 

Q8: What is the number of occurrences of/ in the 
reads where it occurs only once? 

We state several remarks about the queries before 
dwelling on applications. 

1. The queries go by pairs: the first one computes a 
set of positions or read indices, while the second 
computes the cardinality of that set. 

2. Note the clear semantic difference between Ql/ 
Q2 and Q3/Q4. The answer to Ql yields the identi- 
fiers of the reads in which / occurs, while that to Q3 
gives also all its positions in the read. This clearly 
differs since /may occur several times in a read [e,g,, 
if/ is a poly-A sequence). Sometimes the positions 
are needed, sometimes only the reads (see below). 

3. Queries Q5-Q8 are versions of Q1-Q4 constrained 
to a single occurrence of/ in the reads. Of course 
other variants can also be computed, e.g. where the 
number of occurrences is limited by a user defined 
threshold. Since /is constrained to occur only once 
in each read, Q6 and Q8 are equivalent, and we will 
mention only Q6 in the sequel. 



4. The data structure we propose is intended to be 
kept in memory and used for multiple queries. 

Although this paper focuses on the data structure, its 
efficiency, and on the algorithms to solve these type of 
queries, it is important to list applications of these 
queries. In which context of read analysis, can one use 
such queries? Note that in such context, k is smaller 
than the read length. Theoretical and empirical investi- 
gations show that for instance, with k > 19 or 20, k- 
mers indicate in average a single genomic location in 
the human genome [14]. Such values of k can be com- 
puted depending on the genome length. Translated to 
reads or sequences: it is unlikely that two reads sharing 
a /c-mer were not sequenced from the same part of the 
DNA. In other words, sharing a /c-mer is a witness for 
having a common genomic origin. 
Mutation detection 

Putative mutations (SNP, somatic mutations, small 
indels) are indicated by differences between a read and a 
reference genome. Once the reads have been mapped to 
the reference genome, one analyzes the sub-collection of 
reads that covers a genomic position to count how 
many reads support the variation observed in the read 
or that observed in the genome. If one considers the 
two substrings of length k centered on this mutation 
position, one in the read and one in the genome, 
answering Q2 for these substrings will give an approxi- 
mate count of these two haplotypes. If one needs the 
corresponding reads, then Ql is the appropriate query. 
If only a single, or a few reads, share this /c-mer, then a 
sequence error might be suspected [15]. 
Local coverage 

Suppose one is given a target sequence, which can be a 
read or an external sequence. For each of its /c-mer, let 
us call the local coverage, the number of reads sharing 
this /c-mer (this requires a dichotomic search). The local 
coverage profile (/.e. a histogram of the local coverage) 
along the target sequence provides useful information in 
various contexts. For a known mRNA and an RNA-Seq 
experiment, the average local coverage on all /c-mers is a 
proxy for the expression level of the target, while the 
profile enables one to distinguish the target's sub- 
regions expressed at different levels [16,17]. In another 
context, with a genomic library, taking reads as queries 
and looking at their local coverage profile may help to 
detect those overlapping the extremity of a repeated or 
transposable element. This may prove useful to study 
the distribution and evolution of these elements in the 
genome. 

Clustering and assembly without a reference genome 

As for Expressed Sequence Tags, it is suitable to cluster 
and assemble RNA-Seq reads to compute the various 
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transcripts expressed in the assayed library [16,17]. It is 
necessary to detect near exact alignment between pair 
of reads, and this is usually performed efficiently by fil- 
tration using seeds. In such case, very efficient and sen- 
sible seeds are exact shared /c-mers [4]. Here, the sub- 
collection of reads sharing a /c-mer with a given read, as 
well as the /c-mer positions, can be obtained using query 
Q3. The answer to Q4 can help guiding the clustering 
process. 

Similar needs of query occur in the assembly of geno- 
mic reads [18,19]. To know with which reads one can 
assemble a given read without ambiguity, one may per- 
form query Q7 using /c-mers at the 5' or 3' extremities 
of the read. The obtained occurrences together with 
their positions will indicate the matching reads and the 
relative positions of read pairs for assembly. 

Our application list provides examples and is by no 
means exhaustive. We could also mention for instance 
the estimation of the target genome length in assembly 
context, which uses /c-mer counting [20]. Clearly, these 
applications are beyond the scope of this paper. How- 
ever, these paragraphs underline that the proposed data 
structure suits the needs of read processing in various 
application contexts, and will provide a unified frame- 
work for building read analysis programs. 

Results and Discussion 

This section contains the main contribution: a data 
structure to index large read collections, the Gk arrays. 
To describe it, we first introduce the notation, formalize 
the queries, exhibit the index data structure, give its 
construction algorithm, and the procedures for answer- 
ing all queries. This makes the content of the Algo- 
rithms section. Then, in the Comparison section we 
investigate its practical usability compared to two alter- 
natives: one based on a generalized Suffix Array (SA) 
and another based on a hash table. This includes theore- 
tical and practical comparisons. 

Algorithms 

Here, we detail the algorithms to build the Gk arrays 
and to answer the queries. We start by defining more 
formally the queries we want to answer and introduce 
the necessary notation. 
Notation and definition of the queries 
Let S be an alphabet of size a. S'' denotes the set of 
words, strings or sequences over S and, for any integer n, 
denotes the set of words of length n over S. For a 
word X, \x\ denotes the length of x. Given two words x 
and J, we denote by xy the concatenation of x and y. 
For every 0 < i < J < \x\ - 1, x[i] denotes the (/ + l)^'^ 
element of x, and x[i., j] denotes the substring x[i]x[i + 
1] . . . x[j]. Let <L denotes the comparison operator for 
the lexicographic order on words. Lexicographic ranks 



start from zero and all arrays are indexed from zero. For 
any finite set A, we denote its cardinality by #A, 

The input consists a list R = (ro, . . ., r^_i) of q short 
sequences of length m, called reads, which are not 
necessarily distinct. We know that m, k, q ^ N satisfy m 
> k>0, 

A /c-long substring of a word is called a k-mer. For any 
u G we denote by Fj^{u) the set of /c-mers in u: F/^u) 
= {v G I 3p G [0, \u\ - k] such that v = u[p. . p + k - 
1]}. Let/e and let us denote the set of indexes of 
the reads in which /occurs by IndjJJ) = {/ e [0, <5^[| / e 
Fj^irj)}, and the set of positioned occurrences of /in all 
reads by Pos^if) = {(;,^) | ry[^. . ^ + /c - 1] = /, where a 
positioned occurrence is given by the pair made of the 
read index in R and the beginning position of/ in this 
read. Let us denote the restriction of Indj^(f) (resp. Posj^ 
(/)) to subset of read indexes where /occurs only once 
by UIndk(f) (resp. UPosj^if)). Formally, UPosj,(f) = {(/', £) \ 
rj[e, ,e + k'l] =/and V/ ^ rj[i, , i + k - 1] ^ fi, and 
UInd,{f) = {; I (/•, £) G UPos,{f)l Let / g [0, ql J" g [0, 
m ' k + 1[, and let /be the /c-mer starting at /' in read 
r^. Note that here we require the knowledge of the pair 
/')) which defines the /c-mer / Now, the seven /c-mer 
queries can be formally defined as computing 

Q 1 : the set Jn4 (f) Q2 : #Jn4 (/) , the cardinality of Jn4 (f) 
Q3: the setPo5fe(/) Q4: #PoSk(f ), the caradinality of PoSk{f) 
Q5: the set Ulndkif) Q6: #UIndk{f), the cardinality of L7Jn4(/) 
Q7:the set UPosuif) 

Clearly, it appears (see Additional File 1: Proof and 
queries' algorithms) that the algorithms to compute 
Ulndjjj), resp. UPoSf^if), for answering Q5/Q7, simply fil- 
ter Indk(f), resp. Pos/^(f), on the fly, and are thus similar 
to the algorithms for Q1/Q3. For place sake, we will 
only detail the solutions for Q1-Q4 in the sequel. 
The index structure 

Our algorithm relies on four arrays that allow to query 
the /c-mers of all reads. Hence, we define a word made 
of the concatenation of all reads: Cj^ = rori ... r^_i. Of 
course, a /c-mer that overlaps two reads in Cj^ is not 
necessarily a k-mer of some read. Hence, we introduce a 
system to renumber the positions of interest in C/^. The 
rationale behind is to save place in the Gk arrays by dis- 
carding the positions of overlapping /c-mers in Cj^, Let 
us denote by f the number of distinct /c-mers of all 
reads, and for the sake of legibility we set m := m — + 1 
and q := qm = q{m — k+l) (the number of interesting 
positions in a read and in Cr, respectively). We call: 

♦ P'position, a starting position in Cr of a /c-mer that 
is not overlapping two reads, ix, an element of 
^pos := [O, qm[ \ [j \ (j mod m) > m}. 

♦ g, the function that renumbers P-positions in order 
such that their index are consecutive; g is defined by: 
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S '■ ^pos ^ Qpos 

j \-> \ij mj X m + (j mod m) . 



♦ Q'positioriy an image of a P-position by g(.), i^e, an 
element of Qpos := ^(^pos) = l^f^l Note that the set 
Qpos is not a query. 

Clearly, Pp^^ and Qpos have the same cardinality and 
as (y ^ /) implies ^(y) ;^ g{f)y g is bijective. Hence, g'^ 
exists and maps a Q-position back to its corresponding 
P-position in Cj^, Proposition 1 explicits the conversion 
between a positioned occurrence and a P-position. 

Proposition 1, Let (/, £) with j g [0,q[, i e [0,m[ be a 
positioned occurrence of a k-mer in a read. The corre- 
sponding P -position in Cr is jm+t. Conversely, let f be a 
P-position, the corresponding positioned occurrence in a 
read is [\_flm\, f mod m). 

This numbering system is important for it allows us to 
go back and forth between a positioned occurrence in a 
read, its corresponding P-position in and its Q-posi- 
tion that will be stored in our arrays. 

Let y be a Q-position. We denote by 5q(/), resp./Q(/), 
the suffix, resp. the /c-mer, of Cj^ beginning at the P- 
position ^'^(/'), i.e. Sqij) = Cj^[g'^(j) . . qm - 1] and/Q(/') = 
CR[g'^{j) . . g\j) + k - l].We call 5q(/') a P-suffix. Note 
that all suffixes beginning at P-positions have different 
length and are pairwise distinct; thus, there are q such 
suffixes and they all have a different lexicographic rank. 
However, this may, and in real data applications will. 



not be the case for the /c-mers, i.e. the /q(/). We call the 
set {/q(/) I y G Qpos} the set of Pj^-factors, whose cardin- 
ality is r with our notation. 
Now, we define the Gk arrays: 

GkSA (Generalized k Suffix Array) is a modified Suffix 
Array of Cr that lexicographically sorts only the P- 
suffixes, 

GklFA (Generalized k Inverse Factor Array) is a modi- 
fied Inverse Suffix Array (ISA) that stores for each Q- 
position, in position order in Q^, the lexicographic rank 
of the Py^-factors starting at the corresponding P- 
position, 

GkCFA (Generalized k Counting Factor Array) is an 
array that associates to a /c-mer (actually, to its rank) its 
number of occurrences at P-positions in C/^, 

GkCFPS (Generalized k Counting Factor Prefix Sum) 
stores the prefix sums of GkCFA. Since GkCFA and 
GkCFPS are equivalent only one of them is necessary at 
a time. 

Formally, the definitions are (see Figure 1 for an 
example and Figure 2): 

♦ For / a suffix lexicographic rank and y a Q-position 
{i.e. i, j G Qpos), 

GkSA[i] = j iff Sqij) has lexicographic rank / 
among the P-suffixes. 

♦ For / a /c-mer lexicographic rank and y a Q-position 
{i.e. i G [0,f[ andy G Qpos), 

GkIFA[j] = / iff /Q(y) has lexicographic rank / 
among the P/^-factors. 

♦ For / a /c-mer lexicographic rank {i.e. i g [0, f[). 
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Figure 1 Example of the read index data structure: tiie G/c-arrays. Example of the read index data structure: the Gk arrays. Example for a 
collection R = (aacaact, caattca, aacaagc) of q = 3 reads of length m = 7 and considering 3-mers {k = 3). The index is composed of 
three tables and uses a fourth one during construction {GkSA, GklFA, GkCFA, and GkCFPS). The first table shows the starting indices of /c-mers in 
the text made by the concatenation of all reads, Q, the SA built on Cr, and the function g that renumbers P-positions of Q to make them 
consecutive. P-positions are {0,1,2,3,4,7,8,9,10,11,14,15,16,17,18}; all other positions, those starting positions where the /c-mer overlaps two reads, 
are displayed with a gray background (lines j and SA[/]). Line SA refers to the usual Suffix Array of Cr. The /c-mer caa occurs 4 times in Cr at 
positions 2, 7, 12 and 16. Among those, only 2, 7, and 16 are P-positions. The lexicographic rank of the P^-factor starting at position 16 is given 
by GklFA[g{]6)] = GklFAM = 7, and the number of occurrences of the P^-factor caa is given by GkCFA? , which equals 3. The positions of these 
occurrences are thus obtained by the set [g\GkSA[l]) \ GkCFPS[7 - 1] <) < GkCFPS?} = {2,7,16}. See also Figure 2. 
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Figure 2 Accessing the occurrences of a /c-mer in the index. Accessing the index to get the occurrences of a /c-mer starting at position j in 
the concatenation of the reads [i.e., Q. Accessing GklFA, GkCFPS and GkSA: (a) From Cr to GklFA: gij) is the renumbered position of the P- 
position / (b) From GklFA to GkCFPS: GklFA[g{j)] is the lexicographic ranl< of the P^-factor starting at P-position j in Q, and GkCFPS[GklFA[g{j)]] is the 
number of occurrences in Cr of the Prfactors of ranl< less than GklFAigij)]. (c) From GkCFPS to GkSA: The positions of the occurrences of the P^- 
factor starting at position j are in GkSA in the range [GkCFPS[GklFA[g{j)] -1], GkCFPSiGklFAigij)]]]. 



GkCFAli] = #{/-£ Qpo, |/q(/-) =/q {GkSAim 
* For / a /c-mer lexicographic rank {Le, i e [0, f[), the 
definition of the prefix sum is 
GkCFPS\i] = ^j^o GkCFA[t] and 
GkCFPS['l] = 0. 

Remark 1. The array GkCFPS is not essential to the 
algorithm: it is solely there to avoid multiple, time con- 
suming computations of prefix sums over GkGFA (see 
GkGFPS definition above). Moreover, any value of 
GkGFA can also be accessed in constant time using 
GkGFA[i\ = GkGFPS[i] -GkGFPSU -1]. Thus, GkGFPS 
will be kept in memory to replace GkGFA, 

We give some useful properties of Gk arrays. 

Proposition 2. For i e [0,f[, GkGFPS[i] = #{j g Qp^s \ 
foij) ^LfQ{GkSA[i])} (Proof by induction). 

In other words, GkGFPS[i] is the number of P/^-factors 
having lexicographic rank less than or equal to /. Since 
GkSA is sorted on the lexicographic order of the P-suf- 
fixes, it is also sorted on the lexicographic order of the 
P/^-factors. Hence, we get: 

Proposition 3. Let /e such that Ind^if) ^ 0. All 
occurrences of f have the same rank among the Pj^-fac- 
tors, and are stored consecutively in GkSA. 
Construction algorithm 

First, we detail the algorithm for building GkSA, and 
then the one computing GklFA and GkGFA. 
Computation of GkSA We first build the full Suffix 
Array (SA) of Cj^ using a linear time and space algo- 
rithm. Since \Cp\ = mq this first step can be done in O 
(mq). Then GkSA is obtained from SA by selecting only 
the P-positions and by renumbering them to Q-positions 
using function g. This second step is performed in O 



{mq) time and space. Moreover, GkSA is built in place 
of the Suffix Array: our algorithm allocates only the 
memory for the SA table. When answering Q1/Q2, each 
read where a given P/,-factor occurs should be counted 
only once (even if the P/^-factor occurs more than once 
in the read). Similarly, for Q5/Q6, we count only reads 
where a given Py^-factor occurs exactly once. To avoid 
using masks on the reads, we sort in increasing order 
the values of GkSA corresponding to P/^-factors sharing 
the same lexicographic rank (see Table in Additional 
File 1: Proof and queries' algorithms). The values that 
have to be sorted are Q-positions, i.e. integers, thus the 
sort can be performed in linear time on values of GkSA 
using e.g. radix sort [21]. The whole process takes O 
{mq) time and space. 

Computation of GklFA and GkCFA Algorithm 1 
shows how to compute jointly GklFA and GkCFA. Its 
correctness proof is given in Additional File 1: Proof 
and queries' algorithms. 

Algorithm 1: Computation of GklFA and GkCFA. 

Data: GkSA, Cr, k, q 

Result: GklFA and GkCFA 

1 begin 

2 GkIFA[GkSA[0]] <r- 0; 

3 GkCFA[0] <r- 1; 

4 t<^ 0; 

5 foreach i g [l,q[ do 

6 ; ^ GkSA[i]; 

7 f <r- GkSA[i - 1]; 

8 if/Q(/') ^/Q(/")then 

9 t<^ t + I; 

10 GkCFA[t] ^ 0; 

11 GklFAlj] ^ t ; 
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12 GkCFA[t] ^ GkCFA[t] + 1; 

13 return {GklFA and GkCFA); 

Theorem 1. Algorithm 1 correctly computes the arrays 
GklFA and GkCFA, (Proof in Additional File 1; Proof 
and queries' algorithms). 

The comparison between two Py^-factors (line 8) is 
naively performed in 0{k) time, and is the only instruc- 
tion of the inner loop that takes more than constant 
time. Hence, the computation of both GklFA and 
GkCFA is performed in 0((m - k)qk) time. Let us 
emphasize the simplicity of the algorithm, which 
explains the fast construction times obtained in practice. 

Remark 2. Once the values of GkCFA have been cal- 
culated, one can compute the values of GkCFPS in-place 
in 0{{m ' k)q) time (see Remark 1), 
Answering the queries 

Assume the Gk arrays have been built in a preprocessing 
step (see section Construction algorithm); we show how 
to answer the first four queries, starting with Q4 and 
Q3. Let i e [0,qlf e [0,m[, and let / be the k-mer 
starting at y" in read This occurrence of/ in Cj^ is 
found at P-position /:= im + y" and the corresponding 
Q-position is y := g(J'), 

Q4: Computing the cardinality of PoSk(f) First, we 
need to find the lexicographic rank of/ among the Pj^- 
factors, which we obtain directly by setting t := GkIFA[j] 
(by definition of GklFA), The cardinality of Posj^(f) is 
simply the number of occurrences starting at P-positions 
in Cr, which is given by GkCFA[t] (by definition of 
GkCFA), By Remark 1, GkCFA[t] = GkCFPS[t] - 
GkCFPS[t - 1], 

Q3: Computing Posj^(f) By Proposition 3, all occur- 
rences of / starting at P-positions are stored consecu- 
tively in GkSA, It suffices to find the lower and upper 
indices, denoted by £f and Uf respectively. By the order- 
ing of GkSA all occurrences of factors smaller than / in 
the lexicographic order are stored before its occurrences 
in GkSA, Hence, by definition of GkCFPS and Proposi- 
tion 2, we have Uf = GkCFPS[t] and if = GkCFPS[t - 1]. 
Since GkSA is indexed from 0, the starting Q-positions 
of occurrences of/ are comprised in the range [^^ Uf] 
in GkSA, The corresponding P-positions are obtained 
using ^'^(.) and are then transformed into positioned 
occurrences with Proposition 1. This proves Theorem 2. 

Theorem 2. Let f be a k-mer of a read occurring at Q- 
position j in Cr, Then, its lexicographic rank among the 
Pj^-f actors is t := GkIFA[j], If we set Uf:= GkCFPS[t] and 
£f:= GkCFPS[t- 1] then 

1, the starting P-positions offs occurrences in Cr are 
{g\GkSAm\t^ [tj,Uf[l 

2, PoSk(f) = {{[g-\GkSA[i])/m\, g\GkSA[£]) mod 
m)\£G [£j, Uf [}, 



3, #Pos,,if) = Uf - If 

Given Theorem 2, the queries regarding Indjjj) can be 
answered as follows: 
Ql: Indj,(f): = {[g-\GkSA[i])lm\\ £ g [if u^l 
Q2: by counting the elements of Indjjf) while comput- 
ing it. 

The algorithms for Ql, Q3, and Q4 are given exten- 
sively in Algorithms 2, 3, and 4. The algorithms for all 
other queries are included in Additional File 1: Proof 
and queries' algorithms. 

To answer Q7, one computes Posj^(f) and scans it on 
the fly to remove reads (or the positioned occurrences) 
having strictly more than one occurrence of / A similar 
approach solves Q8, and Q5. Variants of these queries 
where the number of allowed occurrences is constrained 
by a parameter can be answered similarly. 
Complexity Answering Q1-Q3 or Q5-Q8 requires to 
scan the values in GkSA inside the range corresponding 
to the /c-mer / which can be performed in 0{occ_Reads 
(f)) time, where occ_Reads(f) denotes the occurrence 
number of/ in the reads. Query Q4 is computed in con- 
stant time using GkCFPS, 

Algorithm 2: Ql {Ind/,(f)) 

Data:/G L\ j g Ppos such that Q[/ .. y + /c - 1] =/ 
Result: The set IndjJj) 

1 begin 

2 Indj^ <r- empty set; 

3 t^GklFAlj]; 

4 if^ GkCFPS[t - I]; 

5 Uf<r- GkCFPS[t]; 

6 prev < — 1; 

7 foreach / g [£f Uf [ do 

8 readlndex <r- [g-^ [GkSA\i])/m\, 

9 if readlndex ^ prev then 

10 Add readlndex to Indj^, prev ^ readlndex; 

11 return {Indj^; 
Algorithm 3: Q3 {Pos^if)) 

Data:/G y g Ppos such that CrU , , j + k - I] =f 
Result: The set Po5^(/) 

1 begin 

2 PoSf^ <r- empty set; 

3 t^GklFAlj]; 

4 if^ GkCFPS[t - 1]; 

5 Uf<r- GkCFPS[t]; 

6 foreach / g [£f Uf [ do 

7 readlndex ^ [g-^ [GkSA\i])/m\, 

8 posInRead <r- g'^iGkSAli]) mod m; 

9 Add the pair (readlndex, posInRead) to Posi^; 

10 return {Pos/^); 

Algorithm 4: Q4 (The cardinality of Pos^if)) 
Data:/G y g Ppos such that CrIj . . y + /c - 1] =/ 
Result: The cardinality of Posj^if) 
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1 begin //GkCFA[t] = GkCFPS[t] 'GkCFPS[t - 1] 

2 t^GkIFA\j]) 

3 return {GkCFA[t\y, 

Practical considerations: implementation and variable read 
length 

The value of /c, which determines the length of /c-mers 
used for querying the collection of reads, is a para- 
meter of our index. However, the Gk arrays remain 
flexible. If for the simplicity of the presentation we 
have assumed until now that all reads have the same 
length, the whole structure can be adapted to a collec- 
tion of reads having variable length. Indeed, since 
some sequencing technologies produce variable-length 
reads [e.g. Roche 454®), this adaptation is an important 
issue of versatility. 

Indexing variable-length reads We show how our 
method can be slightly adapted to tackle this problem. 
Remind that the Gk arrays consider the string Cr, the 
concatenation of all reads, and save place by discarding 
positions at which a /c-mer overlaps two reads. This was 
done efficiently by converting any read position, or im- 
position, into a Q-position, and conversely, using func- 
tion g. Up to now, this function relies on the fact that 
the read length is fixed. Thus, we need to modify its 
definition to accommodate different read lengths. For 
this, we use a bit vector F, as long as Cj^, to record 
which positions in Cr are P-positions: j is a P-position 
iff F[j] = 1. We implement it as a vector having rank 
and select capabilities [22,23]. We define these opera- 
tions as 

♦ ranki(F, /) is the number of ones in F[0..i], 

♦ selecti(F, /) is the position of the /-th one in F (or | 
F\ if there is less than / ones in F), 

These operations can be performed in constant time, 
and F can be stored in a compressed form needing only 
\F\Hq{F) + o{\F\) bits, where Hq is the zero-th order 
empirical entropy of F, Then computing g{j) and g'^ij) 
can be easily performed with a single rank or select 
query. Indeed, we have g{j) = ranki(F, y), and g'^ij) = 
select y). Finally, using little extra memory, Gk arrays 
can also handle variable-length reads. 
Implementation Gk arrays are available as a reusable C 
+-1- library under a Cecill C licence (GPL compliant). It 
accepts standard formats for the input read collection 
(FASTA, FASTQ). Depending on the number of /c-mer 
positions, the user should turn on the 64 bit encoding 
at compilation. It allows to process data sets of more 
than 2^^ positions. Default is set to 32 bit encoding. 
Another compilation option can be activated to handle 
variable-length reads (typically Roche 454® datasets), 
otherwise by default Gk arrays process fixed length 
reads. 



The data structure construction and queries algo- 
rithms are coded in standard C and C+ + . To reduce 
memory consumption, the full SA of Cr is built using 
libdivsufsort library https://code.google.com/p/libdivsuf- 
sort/, which was chosen for its efficiency and low mem- 
ory usage (see https://code.google.eom/p/libdivsufsort/ 
wiki/SACA_Benchmarks for a benchmark of up-to-date 
SA construction algorithms). However, its worst case 
time complexity is not linear in the length of the input 
sequence. Also the sort of values in GkSA inside each 
range corresponding to one P^^-factor is performed with 
the quicksort algorithm. A linear time construction of 
the array GklFA is possible by using an LCP array (array 
storing the length of the Longest Common Prefixes 
between two consecutive suffixes in the lexicographic 
order). However, building this array would need at least 
9mq bytes with Manzini's algorithm [24]. 

We implemented two versions of the Gk arrays: one 
which indexes only fixed-length reads, and another for 
variable-length reads. When not stated otherwise, Gk 
arrays refers to the implementation for fixed-length 
reads. For managing variable-length reads we used Sux 
http://sux.dsi.unimi.it/, an implementation of bit vectors 
with rank and select operations. 

Theoretical and experimental comparisons 

The sequencing capacity of new technologies continues 
to improve. Managing ever increasing read collections 
will be a major bottleneck in the bioinformatic analysis 
of High Throughput Sequencing data. The Gk arrays 
implement one solution to read indexing. If plain, as 
well as compressed, indexing data structures have been 
described in the litterature (cf. Introduction), their abil- 
ity to handle large read collections have not been inves- 
tigated. As we seek to optimise in practice the memory 
consumption, the construction time, and query running 
time, we will compare Gk arrays to two other uncom- 
pressed indexes: a generalized SA (gSA) and hash tables. 
We choose these two alternatives for they represent dif- 
ferent approaches to read indexing. Among the uncom- 
pressed text indexes that have been generalized to 
handle a set of texts, the gSA is reckoned to be one of 
the most memory efficient and has been preferred to 
hash tables or the suffix tree in other contexts [9,25]. 
On the other side, the optimisation of web search 
engines have triggered recent development of highly effi- 
cient hash tables, like Google sparse hash http://code. 
google.com/p/google-sparsehash or the hash tables from 
SGI extension of the C++ Standard Library http://www. 
sgi.com/tech/stl. It is thus instructive to also compare 
Gk arrays to state of the art hash tables. As explained in 
Introduction, compressed indexes save memory but 
induce much longer running times to answer queries 
compared to plain indexes, and have been excluded 
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from this comparison. Nevertheless, designing efficient 
compressed read indexes is a challenging future research 
avenue, which could be addressed by compressing the 
Gk arrays. 

A generalized Suffix Array (gSA) solution 

We detail here the solution based on a generalized Suf- 
fix Array (gSA) to index a collection of reads, all reads 
having the same length. We call it the gSA solution. In 
fact it indexes the string made of the concatenation of 
all reads, Q^. The preprocessing consists in building the 
generalized Suffix Array (gSA), the Inverse Suffix Array 
(ISA), and the Longest Common Prefixes (LCP) array of 
Cr, The gSA is built using the same algorithm than for 
Gk arrays (libdivsufsort). The ISA is built by scanning 
the gSA in mq time, while the LCP array is also con- 
structed in linear time using an efficient algorithm [26]. 
The tables are built in this order and add up in term of 
memory footprint. 

In Figure 3(a) and 3(b), we compare the time and 
space complexities of gSA and Gk arrays solutions. 
Since both start by building gSA{CR) and this is the 



dominant term of the time complexity, we obtain 0{mq) 
time complexity: the space occupied during the con- 
struction of that table alone is 4,02mq, while it amounts 
to 4!mq once built [27]. The last three columns of these 
tables show how the cumulated memory footprint 
evolves after each step during construction. We also 
monitored the memory footprint evolution during the 
construction of gSA and of Gk arrays and illustrate 
these graphically in Figures 4(a) and 4(b), respectively. 
For the gSA the three tables add up in memory and 
each takes 4mq space. With Gk arrays 

1. the GkSA table replaces gSA{C]^) in memory and 
takes only 4mq, 

2. GklFA takes an additional Amq while GkCFA 
occupies 4(f + 1) with f denoting the number of dis- 
tinct P/^-factors, and 

3. finally the GkCFPS replaces GkCFA and takes 
exactly the same space. 

In total, gSA takes 12mq bytes of memory, while Gk 
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Figure 3 Comparing the complexities of the Gk arrays and generalized Suffix Array based solutions. Comparing the complexities of Gk 
arrays and of tine generalized Suffix Array solutions. A complexity is an expression that evaluates the running time or memory usage in function 
of parameters describing the input size. The construction time and space complexities of the index for q reads of length m having f distinct k- 
mers are given for the generalized SA in (a), and for the Gk arrays in (b). We detail the cumulative space complexity during the construction of 
the gSA, and after the main steps of the construction algorithms. I.e.: once the gSA, the ISA, and the LCP arrays are built in (a), and once GkSA, 
GklFA, and GkCFPS are built in (b). In (c) we give the time complexities for answering queries Q1-Q7 with a /c-mer denoted by f. The procedures 
for the gSA depends on occ_Cf^(f), the occurrence number of fin the text made by the concatenation of all reads {I.e. in C/^), while those for the 
Gk arrays depends on occ_Reods(f), the occurrence number of fin all reads, and we know that occ_Reads(f) < occ_Cf^(f). 
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arrays occupy 8mq + 4(f +1) bytes (with 32-bit integers), 
and m:=m — is smaller than m. This explains why 
the memory footprint of Gk arrays remains smaller in 
practice than that of gSA (Figures 4(a) and 4(b)), even 
for varying k values (see Figures 5(a), 4(a) and 4(b)). 
Indeed, the gain of memory provided by Gk arrays 
increases with both k and q. If k is small, each /c-mer 
tends to occur more in average, and thus 
4(f + 1) 4m^/, meaning that GkCFPS is much smaller 
than the LCP array. If k is large then 4(m-/:+l)<5^ « ^mq 
and thus, GkSA plus GklFA tables occupy much less 
place than the gSA and ISA tables. This constitutes, in 
almost all cases, a saving of at least 12{k - l)q bytes. 

Locating a /c-mer in the reads can be done with a bin- 
ary search in 0{k + log qm) worst case time with gSA 
using SA and LCP arrays and 0{k x \ogqm) worst case 
time with Gk arrays using GkSA. (We recall that the 
binary search is the standard procedure in this context 
[8,9]). 

However, Manber and Myers [8] mentioned that a 
simple improvement over the classical binary search 
(namely remembering the minimum length between the 
longest common prefix of the left and middle elements 
and the longest common prefix of the right and middle 
elements at each step of the binary search) permits to 
run in practice as fast as a 0{k + \ogqm) worst case 
method (see also [9] Section 7.14.3 page 152). 

Thus, starting from a /c-mer, rather than from a posi- 
tion, when answering the queries will bring an overhead 
similar in practice for the gSA and Gk arrays. 

Algorithm 5: Ql {Indj^if)) with the generalised Suffix 
Array solution 

Data:/G y!\ j g Ppos such that CrU .. ; + /c - 1] =/ 

Result: The set Indjjj) 

1 begin 



2 Indj^ <r- empty set; 

3 Initialize the whole bit vector, D, to zero; 

4 / <r- /5A [/];// Starting position of /occur- 
rences in SA 

5 repeat 

6 if [SA[i] mod m) <m then 

//the occurrence position does not 
overlap two reads 

7 readlndex ^ [SA[i]/mJ; 

8 if D[readlndex\ ^ 1 then 

//we have not found an occurrence 
in this read yet 

9 Add readlndex to Indk) 

10 D[readlndex] ^ 1; 

11 i <^ i 1; 

12 until (/ > qm) or {LCP[SA[il SA[i + 1]] < k); 

13 return {Ind/^); 

Nevertheless although we consider the same input, a 
position ; of occurrence of the /c-mer in a read, answer- 
ing queries differ between the Gk arrays and gSA solu- 
tions. Indeed, since the gSA stores all positions in C/^, 
we need to filter out positions of /c-mers that overlap 
two reads in Cj^ to keep only P-positions. This adds 
instructions to the procedure compared to that for the 
Gk arrays: see line 6 in Algorithm 5, which gives the 
algorithm for query Ql with the gSA. For answering 
queries Ql and Q2, we must perform another slight 
modification: we use a binary mask for dealing with 
duplicate /c-mers in a same read. This mask is stored in 
a binary vector B having q bits, one bit per read. The bit 
corresponding to a read is set to one whenever the k- 
mer has been found to occur in that read, and subse- 
quent occurrence positions in that read will be filtered 
out if the corresponding bit is set (lines 8 and 10 in 
Algorithm 5). 
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Figure 4 Evolution of memory footprint during the construction of the generalized Suffix Array and of Gk arrays. Evolution of memory 
footprint during the construction of the generalized Suffix Array (a) and of Gk arrays (b) when indexing 15 million 75 bp reads with k = 25. 
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Figure 5 Memory and construction time comparison between the Suffix Array solution, the hash table and the Gk arrays. Memory and 
construction time comparison between tine generalized Suffix Array solution (gSA), the hash tables (HT) and the Gk arrays. K562 dataset is used 
for that experiment, with 5 million to 25 million reads. The length of /c-mers ranges from 15 to 30. gSA plots have been shifted left and HT plots 
have been shifted right for easing the reading, (a) Maximal memory usage while constructing the index and querying it. The error bars represent 
the space consumption depending on the value of k. (b) Construction time for the three indexes on the same data as for the maximal memory 
consumption. The levels of gray on the plots represent the value of k. 



Assume we query on a /c-mer / from one of its occur- 
rence position Let us denote by occ_C]^(f) the number 
of occurrences of / in Cj^, including those overlapping 
two reads {Le,, starting at non P-positions), and by occ_- 
Reads(f) the number of its occurrences that are totally 
included in a read {i.e,, those starting at P-positions). 
For Q1/Q2, Q5-Q7, we obtain with gSA a complexity of 
0{q + occ_C]^(f)) since one initializes the bit vector B of 
size q and scan all occ_C]^(f) occurrences. While with Gk 
arrays, the complexity depends linearly on occ_Reads(f) 
and we know that occ_Reads(f) < occ_C]^(f), 

For Q3/Q4, there is no need of a bit vector with the 
gSA method, hence their complexity is 0{occ_Cj^(f)), for 
one needs to scan positions in the gSA using the ISA 
and the LCP arrays. However, Gk arrays offer a com- 
plexity of 0{occ_Reads(f)) for Q3 and 0(1) for Q4. We 
summarize all queries time complexities in Figure 3(c). 

Remark 3. To avoid scanning occ_Cj^(f) entries, an 
alternative solution consists in delimiting reads inside Cr 
using a separator. This solution would lead to a space 
overhead of q bytes for lowering the time complexity to 
occ_Reads(f). However we did not retain this solution 
since our goal is to diminish the space complexity and 
this solution would not improve much the time 
complexity, 

A solution based on a hash table 

An alternative solution is to index all /c-mers in a hash 
table and to store for each read the list of its occurrence 
positions in the read collection. This list will contain 
pairs of integers: the read index in the collection, and 
the starting position of the /c-mer in that read. The read 
index can be stored on a 32-bit integer, while a 16-bit 



integer suffices for the starting position. In such a case, 
storing the text is not necessary. The number of entries 
is the number of distinct /c-mers in the read collection, 
i.e. our parameter f. Generally, f is small compared to 
4^^ for values of k in [15,60]. Hence the hash table will 
be sparsely populated. We tried several implementation 
of state of the art hash tables: the Google sparse and 
dense hash arrays, and that from SGI extension of the C 
++ Standard Library (called hash map). 

Preliminary experiments have shown that Google sparse 
requires significantly much longer to build than SGI hash 
map, while having a lower memory footprint. With 20 mil- 
lion 75 bp reads, Google sparse hash occupies one third of 
the memory needed by the SGI hash map, but it takes 
thrice more time to build. On the contrary Google dense 
hash tables takes twice more memory, and offers only 
similar construction time. Hence, SGI extension hash map 
exhibited the best compromise in term of memory con- 
sumption and construction time compared to Google 
implementations. Thus, we choose SGI extension imple- 
mentation for the comparison with Gk arrays. 
Experimental settings 

We tested index structures on three datasets. 

1. We used a collection of 40 million lUumina® 
RNA-Seq reads of length 75 from a human K562 
library taken from the RGASP data (Accession num- 
ber GM12 87 8 at http://www.gencodegenes.org/rgasp 
with permission from B. Wold). We call it the K562 
dataset. 

2. We compiled several lanes of Roche 454® geno- 
mic sequencing to obtain a collection of 2.8 million 
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reads ranging from [40,3000] bp with an average 
read length of 523 bp. These were sequenced on a 
Roche 454® GS FLX platform with Titanium chem- 
istry for the Khoisan genome project [28]. We call it 
the Khoisan dataset. 

3. As much longer fixed length reads are not yet 
available, we constructed a collection of fixed length 
reads by slicing the Khoisan reads in non-overlap- 
ping pieces of 150 bp. We obtained 25 millions of 
150 bp reads, a read length that will soon be gener- 
ated on High Throughput Sequencing platforms. 

In the first and third collections, reads have a fixed 
length, while in the second their length varies. The 
experiments were performed on an Intel Xeon 2.27 GHz 
equipped with 48 GB of main memory, and running 
Linux 2.6.18 with C++ compiled using gcc version 3.4.6 
and -02 -funroll- loops options. 
Experimental comparison 

The use of read indexing raises three questions: how 
much computing resources does the index demand? Is 
it scalable? How fast can it answer large number of 
queries? Clearly the resources will depend on the num- 
ber of reads (parameter q), their lengths and on the 
length of /c-mers (parameter /c). We compare three solu- 
tions: a hash table (HT), a generalized Suffix Array 
(gSA), and Gk arrays. 

Scalability We measured the construction time and 
amount of memory taken by all solutions for various 
numbers of reads and /c-mer lengths. Figure 5(a) plots 
the maximal memory footprint on K562 data. At this 
scale, the value of k impacts only the hash table size; its 
influence on the gSA and Gk arrays is not visible on 
that graph. Second, the solutions can be ordered as fol- 
lows: Gk arrays take the less memory, followed by the 
gSA, and then the hash table. This order is irrespective 
of the read number. For k = 20 e.g,, Gk arrays use 10 
GB, the gSA uses 20, and the HT 44, and the curves 
clearly indicate that these differences increase with the 
number of reads. Whatever the value of q, the hash 
table requires twice as much memory as the gSA, which 
itself takes at least 70% more memory than Gk arrays. 
With 25 million reads the hash table saturates the mem- 
ory, with 30 million the gSA also does, while the Gk 
arrays constitute the only solution able to index the 
whole collection, 40 million reads, on that computer. 
Note that in both cases, the 64-bit implementation of 
gSA and Gk arrays have to be used to index that 
amount of reads. For the whole read collection, Gk 
arrays needs at most 43 GB {k =15) and at least 36 GB 
{k = 30). 

For all solutions, construction times increase linearly 
with the number of reads as expected (Figure 5(b)). It 
remains very similar between the gSA and Gk arrays. 



which both takes e.g. <1000 s. for 25 million reads. The 
influence of k is clearly visible on the hash table for 20 
million reads: its construction time decreases with k 
because the parameter f also does (for a given number 
of reads). As long as they fit in memory, all compared 
solutions offer practical construction times. 

We examined the behavior of Gk arrays on much 
longer reads, 150 bp, when variable-length read option 
is activated and when it is not. Figure 6(a) plots space 
consumption, while Figure 6(b) records the construction 
time for both options. 

We see that adding a bit vector is not space consum- 
ing since there is little difference between the two meth- 
ods (Figure 6(a)). For 13 million reads, the difference is, 
at most, of 300 MB between the two methods. In Figure 
6(b), we plotted the construction time for both indexes. 
The variable length read implementation becomes 
slower when the number of reads grows, compared to 
the fixed length Gk arrays. This shows that despite a 
constant-time theoretical complexity for rank and select 
operations; there is a dependency on the length of the 
bit vector in practice. However, the construction time 
remains reasonable in the variable case. 

Figures 7(a) and 7(b) plot space and time measured 
for the hash table and Gk arrays (with variable length 
reads option) on the Khoisan read collection. The gSA 
has not been implemented to handle variable length 
reads; note that the relative cost would have been simi- 
lar to that observed with Gk arrays between fixed and 
variable read length options. Here for one million reads, 
variable length Gk arrays require 470 s. to build vs 428 
s. for the hash table, but 8 times less memory (5.5 vs 46 
GB). The difference increases strongly with the read 
number. Above one million reads, the memory footprint 
of the hash table exceeds the computer memory (which 
is 48 GB), while Gk arrays index the complete collection 
of 2.8 million reads on the same hardware with <15.6 
GB. Hash tables appear to be more space consuming on 
the Khoisan dataset than on the K562 dataset. This can 
be explained by the nature of the data. Roche 454® 
sequencers offer a coverage depth much lower than Illu- 
mina's. Hence the number of distinct /c-factors in the 
reads is likely to be greater with the Khoisan dataset. 
Answering queries We measured the mean time needed 
to answer 100,000 random queries of Q1-Q4. Since Q5- 
Q7 are slight variations of Q1-Q3 we do not report on 
these queries. 

Figure 8 shows how the mean time for each solution 
vary with the number of indexed reads [q) and k on the 
K562 collection. Clearly, the influence of q is similar for 
all solutions, and small compared to the differences 
between solutions. Generally, gSA takes always longer 
than the hash table irrespective of the query type, and it 
also takes longer than Gk arrays for Q1-Q2 and Q4, and 
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Figure 6 Comparing Gk arrays with fixed and variable length reads. Experiments on the sliced Khoisan dataset (150 bp-reads) with Gk 
arrays (fixed-length reads and variable-length reads), (a) Maximal space consumption of each index for several read numbers and values of k 
(the error bars represent the variation in space usage, depending on k). (b) Index construction time (the lighter gray corresponds to the smaller 
k). (c) Query computation time for 13,000,000 reads depending on different k values. 



a similar time for Q3. The order between the hash table 
and Gk arrays depends on the query type. They are 
equally fast on Ql, the hash table does slightly better on 
Q2, clearly better on Q3, while Gk arrays is much faster 
on Q4. Anyway, for both the hash table and Gk arrays, 
the mean running time is in the order of or less than 10 
microseconds for Q1-Q3, and around 0.1 microsecond 
for Q4 with the Gk arrays, meaning reasonable practical 
times. 

For our comparison of Gk arrays with fixed or variable 
length read options, we see that the latter is becoming 
slower than the former (up to 7 times slower) when k is 
small, i.e. when the number of occurrences of /c-mers is 
large. With larger /c, the query time of the latter 
diminishes and becomes 2 to 3 times slower than with 
fixed Gk arrays. 

With variable length reads (Figure 7(c)) the query 
times remain practical, but the hash table needs between 
1 and 32 fold less time than Gk arrays depending on the 
query. 

In summary, under various conditions Gk arrays are 
equivalent in construction time to a generalized Suffix 
Array or to a hash table. Compared to these solutions. 



they also offer reasonable query times under all circum- 
stances; however, Gk arrays clearly outperform them in 
terms of memory footprint, the main bottleneck for pro- 
cessing High Throughput Sequencing data. 

Conclusions 

As High Throughput Sequencing becomes widespread, 
computational biology will face the challenge of mana- 
ging astronomical quantities of short sequences. Mining 
such amount of sequences is feasible if the sequences 
are indexed in a preprocessing step. An index is a data 
structure that, like a telephone book, enables one to find 
easily a piece of information. For some value /c, it 
records the positions of all /c-mers in the reads in an 
organized fashion to minimize the memory usage. Then 
finding the reads related to some /c-mer takes as long as 
reading the /c-mer and listing the corresponding reads, 
but not as long as scanning all the reads. In other 
words, read indexing factorizes the results of searches, 
which later speeds up the numerous queries made while 
the index is kept in memory. Our main contribution is 
to propose such an index: the Gk arrays. They are fast 
to build, require less space than alternative 
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Figure 7 Comparing hash table with Gk arrays on variable-length reads. Experiments on the Khoisan Roche 454® dataset (variable length 
reads) with a hash table and Gk arrays (with variable-length read option), (a) Maximal space consumption of each index for several read 
numbers and values of k (the error bars represent the variation in space usage, depending on k). The lower space consumption corresponds to 
a lower k. (b) Index construction time (the lighter gray corresponds to the smaller k). (c) Query computation time for 600,000 reads depending 
on the value of k. 



uncompressed solutions, and can thus handle larger 
read collections: 40 million vs 20 million reads for the 
hash tables with a memory limited to 48 GB. It is a key 
issue in practice. 

While being comparable to hash tables in terms of 
time efficiency, only the Gk arrays can completely index 
a large read collection (like the K562 dataset) with a 
memory size available on nowadays computing servers. 
Moreover, our index remains fast for a wide range of 
values of parameter k (the length of /c-mers). We have 
also shown that Gk arrays are both faster and smaller 
than an alternative generalized Suffix Array approach. 
Similarly, on variable-length reads like a Roche 454® 
dataset, Gk arrays can handle the whole read collection 
using less than 16 GB while hash tables are limited to a 
smaller sub-collection (about 1 million reads) on a 48 
GB machine. 

The Gk arrays answer efficiently different types of 
queries, but they have been optimised for queries where 
the searched /c-mer is extracted from an indexed read. 
Sometimes one wishes to know for a given /c-mer the 
reads in which it occurs and its positions inside those 



{e.g. assembly), while in other contexts one only wants 
the number of reads sharing this /c-mer {e.g. estimation 
of expression level). Moreover, Gk arrays adapt well to 
variable length reads. Their scalability and versatility are 
key advantages, which allows to envisage multiple appli- 
cations as mentioned in Introduction. However, scaling 
up to gigantic datasets (terabytes of data), as the ones 
obtained in large metagenomic projects, will require 
compressed read indexes. The simplicity of use of our 
index, and its implementation as a C++ library make it 
a software brick that can be easily exploited in future 
programs or further developed by the community. 

For mapping reads on a reference sequence, solutions 
exist that index reads with hash tables [6,29]. For the 
error correction problem, other works have indexed 
reads with classical text indexing solutions: with a gen- 
eralized suffix trie [15,30], a suffix array [31], or hash 
tables [32]. Gk arrays represent a first, attractive read 
indexing solution; it is specialised for this question and 
should suit different applications. Nevertheless, one can 
envisage several research perspectives. Indexing approxi- 
mate /c-mers or spaced seeds will authorize more types 
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Figure 8 Queries' running time comparison between the Suffix Array solution, the hash table and the Gk arrays. Queries' running time 
comparison between the Suffix Array solution, the hash table and the Gk arrays. Answering the queries Q1/Q2/Q3/Q4 on K562 dataset (75 bp 
reads). The plots represent the average time in ^s over the same 100,000 queries of the corresponding type {i.e., Ql, Q2, Q3, or Q4). In all cases, 
the running time decreases when k increases for there are less occurrences in the reads of a say 30-mer than of a 15-mer. The Gk arrays are 
always faster than the Suffix Array; they even compute Q4 in constant time. 



of queries, but will certainly increase the construction 
time and space requirements. Designing a dynamic con- 
struction algorithm for Gk arrays would futher enlarge 
their range of applications. Another challenge is to com- 
press Gk arrays by storing sampled positions and recom- 
puting other positions at run time, as done with the 
Burrows Wheeler transform [5]. This would enable the 
user to adapt the index to its computer memory, while 
sacrificing some of its performance. 

Additional material 



Additional File 1: Proof and queries' algorithms. 
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